Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock on zombie_threads_mtx #234

Merged
merged 6 commits into from
Apr 6, 2017

Conversation

rafalcieslak
Copy link
Contributor

During thread_exit, the mutex on zombie_threads must be unlocked before critical section is left. Otherwise we risk preemption before the mutex is unlocked, and as the thread is TDS_INACTIVE, it will never resume execution - permanently blocking any other threads from locking on the mutex.

I've observed this problem several times on my local machine, but I've only managed to trigger it with QEMU, and thus I cannot provide a seed for reproducing the issue.

@rafalcieslak
Copy link
Contributor Author

rafalcieslak commented Mar 27, 2017

Oh, this just popped in in a master Travis build. The test thread_join is unable to thread_create, because it gets stuck in thread_reap where it is unable to lock on the zombie_threads mutex - it is owned by fd_test thread, which was preempted before it unlocking the mutex.

EDIT: And the same happened in this build for #232! So there are two PRs each fixing one issue, and each fails tests because of the other bug. I have a feeling we are going in the right direction.

@cahirwpz
Copy link
Owner

Locking strategy of code that starts in sys/thread.c:133 looks like utter mess. A brief look at NetBSD's proc and lwp (FreeBSD's proc and thread reveal similar structure) shows that they use at least one mutex (l_mutex, p_lock) for whole task structure and additional conditional variable (l_waitcv, p_waitcv) to aid exit and join calls implementation.

Please add td_lock and td_waitcv to thread structure and provide cleaner solution to fix the bug.

sys/thread.c Outdated
@@ -70,6 +70,9 @@ thread_t *thread_create(const char *name, void (*fn)(void *), void *arg) {
td->td_kstack.stk_base = (void *)PG_VADDR_START(td->td_kstack_obj);
td->td_kstack.stk_size = PAGESIZE;

mtx_init(&td->td_lock, MTX_DEF);
cv_init(&td->td_waitcv, "wait");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"td_waitcv" for better diagnostics

Copy link
Owner

@cahirwpz cahirwpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close, but not there yet.

sys/thread.c Outdated
{
td->td_exitcode = exitcode;
td->td_state = TDS_INACTIVE;
cv_signal(&td->td_waitcv);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv_broadcast to wake up all thread that awaits td exit?

@@ -131,16 +136,18 @@ noreturn void thread_exit(int exitcode) {
fdtab_release(td->td_fdtable);

mtx_lock(&zombie_threads_mtx);
TAILQ_INSERT_TAIL(&zombie_threads, td, td_zombieq);
mtx_unlock(&zombie_threads_mtx);

critical_enter();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't need critical_enter here as long as scheduler respects td_lock mutex. Please have a look at preferred solution kern_thread.c:488. You'll need to add sched_abandon function that removes a thread from scheduler and makes sure it will not ever be returned to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the solution you've linked to. The entire function is marked with a comment:

Always called with scheduler locked.

which suggests that there is an external mechanism which prevents the scheduler from running while a thread is exiting. The thread mutex gets locked only at the very last moment, which also hints at a different synchronization mechanism besides this mutex.

Also, I do not understand what do you mean by:

as long as scheduler respects td_lock mutex.

Do we want to prevent the scheduler from switching context to threads whose mutex is locked? Should the scheduler lock (or, actually, try-lock) the thread mutex when picking next thread? For this to work well w/o obvious deadlock, we would also have to prevent preemption of threads who own their own thread mutex. I'm under the impression that this could lead to complex problems.

I'd appreciate some detailed description of the idea you want me to implement.

include/thread.h Outdated
@@ -25,6 +27,9 @@ typedef struct thread {
TAILQ_ENTRY(thread) td_zombieq; /* a link on zombie queue */
char *td_name;
tid_t td_tid;
/* Locks*/
mtx_t td_lock;
condvar_t td_waitcv;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these to the top of the structure and describe the purpose of td_waitcv.

This was referenced Apr 6, 2017
@cahirwpz
Copy link
Owner

cahirwpz commented Apr 6, 2017

Ok. I believe there's no simple and clean way to fix the problem tackled by this PR. While I consider the change to be hack, we cannot wait any longer for it being integrated.

@rafalcieslak please look at issue #242, I'll put more information there soon.

@cahirwpz cahirwpz merged commit 66d8f3a into cahirwpz:master Apr 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants